import numpy as np
#| eval: false
rng = np.random.default_rng(0)
n0 = n1 = 100
X0 = rng.normal(size=(n0,2)) + np.array([-2.0,0.0])
X1 = rng.normal(size=(n1,2)) + np.array([2.0,0.0])
X = np.vstack([X0,X1])
X = np.c_[np.ones(len(X)),X] # add intercept
y = np.r_[np.zeros(n0), np.ones(n1)]Bayesian Logistic Regression from First Principles
Overview
This notebook demonstrates a from-scratch implementation of Bayesian logistic regression using NumPy, with posterior uncertainty quantified via the Laplace approximation. The implementation includes numerically stable likelihood computations, Newton optimization for MAP estimation, and Monte Carlo integration for predictive inference.
Key highlights: - Complete derivation and implementation of the posterior (prior + likelihood) - MAP estimation via Newton’s method with line search - Laplace approximation for posterior uncertainty - Posterior predictive distributions via Monte Carlo sampling - Validation against PyMC’s NUTS sampler
If you have limited time: Jump to the Comparison with PyMC section to see how the from-scratch implementation matches a state-of-the-art probabilistic programming framework.
Future directions: Planned extensions include full posterior sampling via Hamiltonian Monte Carlo and hierarchical Bayesian models with varying intercepts.
Setup and synthetic data
I generate a well-separated binary classification dataset with two classes of 100 points each. The classes are centered at (-2, 0) and (2, 0) with unit variance. An intercept column is added to enable a bias term in the logistic model.
This controlled setting makes it straightforward to visualize decision boundaries and verify that the implementation behaves correctly. All randomness is seeded for full reproducibility.
MAP estimation via Newton’s method
The maximum a posteriori (MAP) estimate maximizes the log posterior, which combines a Gaussian prior on the weights with the logistic likelihood:
\[\log p(w \mid D) \propto \log p(D \mid w) + \log p(w)\]
For logistic regression with a Gaussian prior, the log posterior is strictly concave, guaranteeing a unique global maximum. Newton’s method exploits this structure by using exact second-order curvature information (the Hessian) to converge rapidly—typically in fewer than 10 iterations.
The implementation includes a backtracking line search to ensure each step increases the log posterior, making the optimizer robust even with poor initializations.
from bayes_fp import newton_map_logistic
#| eval: false
w_hat, H_hat, info = newton_map_logistic(X, y, sigma0=0.5, verbose=True)iter 1 lp=-139.306810 ||step||=8.050e-01 t=1.00e+00
iter 2 lp=-48.057961 ||step||=5.374e-01 t=1.00e+00
iter 3 lp=-30.558037 ||step||=4.434e-01 t=1.00e+00
iter 4 lp=-25.852892 ||step||=2.197e-01 t=1.00e+00
iter 5 lp=-25.288095 ||step||=3.531e-02 t=1.00e+00
iter 6 lp=-25.277596 ||step||=7.188e-04 t=1.00e+00
iter 7 lp=-25.277592 ||step||=2.874e-07 t=1.00e+00
The function returns: - w_hat: the MAP estimate - H_hat: the Hessian at the MAP (needed for the Laplace approximation) - info: a dictionary with convergence diagnostics
Posterior uncertainty via the Laplace approximation
A point estimate alone doesn’t capture parameter uncertainty. The Laplace approximation provides a Gaussian approximation to the posterior by taking a second-order Taylor expansion of the log posterior around the MAP:
\[p(w \mid D) \approx \mathcal{N}(w_{\text{MAP}}, \Sigma)\]
where \(\Sigma = -H^{-1}\) and \(H\) is the Hessian of the log posterior evaluated at the MAP. This approximation is exact for Gaussian posteriors and works well when the posterior is unimodal and approximately symmetric.
The key advantage is computational efficiency: we reuse the Hessian from optimization rather than running expensive sampling algorithms.
from bayes_fp import laplace_covariance
#| eval: false
Sigma = laplace_covariance(H_hat)Posterior predictive inference
Rather than making predictions with a single weight vector, Bayesian inference propagates uncertainty through to predictions via the posterior predictive distribution:
\[p(y_{\text{new}} = 1 \mid x_{\text{new}}, D) = \int \sigma(x_{\text{new}}^T w) \, p(w \mid D) \, dw\]
I approximate this integral via Monte Carlo: 1. Draw weight samples from the Laplace-approximated posterior 2. Compute predictions for each weight sample 3. Average the resulting probabilities
This produces smoother, better-calibrated predictions than using the MAP estimate alone, especially in regions far from the training data where uncertainty is higher.
from bayes_fp import predictive_laplace_mc
probabilities = predictive_laplace_mc(Xnew, w_hat, Sigma, nsamples=5000, seed=1)Complete example with visualization
Here’s the full pipeline in action: data generation, MAP estimation, Laplace approximation, and posterior predictive inference at three test points.
import numpy as np
from bayes_fp import newton_map_logistic, laplace_covariance, predictive_laplace_mc
rng = np.random.default_rng(0)
n0 = n1 = 100
X0 = rng.normal(size=(n0,2)) + np.array([-2.0,0.0])
X1 = rng.normal(size=(n1,2)) + np.array([2.0,0.0])
X = np.vstack([X0,X1])
X = np.c_[np.ones(len(X)),X]
y = np.r_[np.zeros(n0), np.ones(n1)]
w_hat, H_hat, info = newton_map_logistic(X, y, sigma0=0.5, verbose=True)
Sigma = laplace_covariance(H_hat)
# Predictions at three test locations
Xnew = np.c_[np.ones(3), np.array([[-3, 0], [0, 0], [3, 0]])]
probabilities = predictive_laplace_mc(Xnew, w_hat, Sigma, nsamples=5000, seed=1)
print("Predicted probabilities at new locations:", probabilities)
print("MAP estimate:", w_hat)iter 1 lp=-139.306810 ||step||=8.050e-01 t=1.00e+00
iter 2 lp=-48.057961 ||step||=5.374e-01 t=1.00e+00
iter 3 lp=-30.558037 ||step||=4.434e-01 t=1.00e+00
iter 4 lp=-25.852892 ||step||=2.197e-01 t=1.00e+00
iter 5 lp=-25.288095 ||step||=3.531e-02 t=1.00e+00
iter 6 lp=-25.277596 ||step||=7.188e-04 t=1.00e+00
iter 7 lp=-25.277592 ||step||=2.874e-07 t=1.00e+00
Predicted probabilities at new locations: [0.00362788 0.53624058 0.99728352]
MAP estimate: [ 0.15209971 2.02803149 -0.17091209]
The visualization below shows the posterior predictive probability surface across the input space. The decision boundary (0.5 contour) cleanly separates the two classes, and uncertainty increases smoothly in regions far from the training data.
Converged: True in 7 iterations
MAP: [ 0.1521 2.028 -0.1709]
Comparison with PyMC
To validate the implementation, I fit an identical model using PyMC with the No-U-Turn Sampler (NUTS), a state-of-the-art Hamiltonian Monte Carlo algorithm. The model uses the same Gaussian prior (\(w \sim \mathcal{N}(0, 0.5^2 I)\)) and Bernoulli likelihood.
For this well-behaved problem, the Laplace approximation is expected to work extremely well: the posterior is unimodal, concentrated, and close to Gaussian. Under these conditions, a second-order Taylor expansion captures nearly all the posterior mass, and the approximation \(p(w \mid D) \approx \mathcal{N}(w_{\text{MAP}}, \Sigma)\) is highly accurate.
import pymc as pm
import arviz as az
with pm.Model() as pymc_logreg:
sigma0 = 0.5
w = pm.Normal("w", mu=0.0, sigma=sigma0, shape=X.shape[1])
logit_p = pm.math.dot(X, w)
y_obs = pm.Bernoulli("y_obs", logit_p=logit_p, observed=y)
idata = pm.sample(
draws=1000,
tune=1000,
chains=4,
target_accept=0.9,
random_seed=0,
return_inferencedata=True,
)
print(az.summary(idata, var_names=["w"], round_to=3))The comparison reveals excellent agreement. The PyMC posterior mean is nearly identical to the MAP estimate (differences are negligible relative to coefficient magnitudes), and the decision boundaries overlap almost perfectly. This confirms that:
- The analytic gradient and Hessian implementations are correct
- The Newton solver reliably finds the global optimum
- The Laplace approximation faithfully captures the posterior for this dataset
In more challenging settings—such as weak class separation, strong collinearity, or highly skewed posteriors—the Laplace approximation can break down, and full MCMC sampling becomes necessary. However, for well-behaved problems like this one, the Laplace approach provides an accurate and computationally efficient alternative.
Summary
This notebook implements Bayesian logistic regression from the ground up, demonstrating proficiency with:
- Probabilistic modeling: Deriving and implementing Bayesian posteriors
- Numerical optimization: Newton’s method with line search for MAP estimation
- Uncertainty quantification: Laplace approximation and posterior predictive inference
- Scientific computing: Numerically stable implementations in NumPy
- Validation: Comparison against established probabilistic programming tools
The close agreement with PyMC validates the implementation and highlights when computationally efficient approximations can substitute for full sampling algorithms. The modular code structure in bayes_fp.py makes it straightforward to extend this work to more complex models and inference methods.